Search CORE

389 research outputs found

A Dynamic I/O-Efficient Structure for One-Dimensional Top-k Range Reporting

Author: Tao Yufei
Publication venue
Publication date: 01/01/2014
Field of study

We present a structure in external memory for "top-k range reporting", which uses linear space, answers a query in O(lg_B n + k/B) I/Os, and supports an update in O(lg_B n) amortized I/Os, where n is the input size, and B is the block size. This improves the state of the art which incurs O(lg^2_B n) amortized I/Os per update.Comment: In PODS'1

arXiv.org e-Print Archive

University of Queensland eSpace

A Simple Parallel Algorithm for Natural Joins on Binary Relations

Author: Tao Yufei
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 23rd International Conference on Database Theory (ICDT 2020)
Publication date: 01/01/2020
Field of study

Dagstuhl Research Online Publication Server

Massively Parallel Entity Matching with Linear Classification in Low Dimensional Space

Author: Tao Yufei
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 21st International Conference on Database Theory (ICDT 2018)
Publication date: 01/01/2018
Field of study

In entity matching classification, we are given two sets R and S of objects where whether r and s form a match is known for each pair (r, s) in R x S. If R and S are subsets of domains D(R) and D(S) respectively, the goal is to discover a classifier function f: D(R) x D(S) -> {0, 1} from a certain class satisfying the property that, for every (r, s) in R x S, f(r, s) = 1 if and only if r and s are a match. Past research is accustomed to running a learning algorithm directly on all the labeled (i.e., match or not) pairs in R times S. This, however, suffers from the drawback that even reading through the input incurs a quadratic cost. We pursue a direction towards removing the quadratic barrier. Denote by T the set of matching pairs in R times S. We propose to accept R, S, and T as the input, and aim to solve the problem with cost proportional to |R|+|S|+|T|, thereby achieving a large performance gain in the (typical) scenario where |T|<<|R||S|. This paper provides evidence on the feasibility of the new direction, by showing how to accomplish the aforementioned purpose for entity matching with linear classification, where a classifier is a linear multi-dimensional plane separating the matching and non-matching pairs. We actually do so in the MPC model, echoing the trend of deploying massively parallel computing systems for large-scale learning. As a side product, we obtain new MPC algorithms for three geometric problems: linear programming, batched range counting, and dominance join

Dagstuhl Research Online Publication Server

Parallel Acyclic Joins with Canonical Edge Covers

Author: Tao Yufei
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 25th International Conference on Database Theory (ICDT 2022)
Publication date: 01/01/2022
Field of study

In PODS'21, Hu presented an algorithm in the massively parallel computation (MPC) model that processes any acyclic join with an asymptotically optimal load. In this paper, we present an alternative analysis of her algorithm. The novelty of our analysis is in the revelation of a new mathematical structure -- which we name "canonical edge cover" -- for acyclic hypergraphs. We prove non-trivial properties for canonical edge covers that offer us a graph-theoretic perspective about why Hu's algorithm works.Comment: Accepted to ICDT'2

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Distribution-Sensitive Bounds on Relative Approximations of Geometric Ranges

Author: Tao Yufei
Wang Yu
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 35th International Symposium on Computational Geometry (SoCG 2019)
Publication date: 01/01/2019
Field of study

A family R of ranges and a set X of points, all in R^d, together define a range space (X, R|_X), where R|_X = {X cap h | h in R}. We want to find a structure to estimate the quantity |X cap h|/|X| for any range h in R with the (rho, epsilon)-guarantee: (i) if |X cap h|/|X| > rho, the estimate must have a relative error epsilon; (ii) otherwise, the estimate must have an absolute error rho epsilon. The objective is to minimize the size of the structure. Currently, the dominant solution is to compute a relative (rho, epsilon)-approximation, which is a subset of X with O~(lambda/(rho epsilon^2)) points, where lambda is the VC-dimension of (X, R|_X), and O~ hides polylog factors. This paper shows a more general bound sensitive to the content of X. We give a structure that stores O(log (1/rho)) integers plus O~(theta * (lambda/epsilon^2)) points of X, where theta - called the disagreement coefficient - measures how much the ranges differ from each other in their intersections with X. The value of theta is between 1 and 1/rho, such that our space bound is never worse than that of relative (rho, epsilon)-approximations, but we improve the latter\u27s 1/rho term whenever theta = o(1/(rho log (1/rho))). We also prove that, in the worst case, summaries with the (rho, 1/2)-guarantee must consume Omega(theta) words even for d = 2 and lambda <=3. We then constrain R to be the set of halfspaces in R^d for a constant d, and prove the existence of structures with o(1/(rho epsilon^2)) size offering (rho,epsilon)-guarantees, when X is generated from various stochastic distributions. This is the first formal justification on why the term 1/rho is not compulsory for "realistic" inputs

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Range Updates and Range Sum Queries on Multidimensional Points with Monoid Weights

Author: Lu Shangqi
Tao Yufei
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 33rd International Symposium on Algorithms and Computation (ISAAC 2022)
Publication date: 01/01/2022
Field of study

Let P be a set of n points in ?^d where each point p ? P carries a weight drawn from a commutative monoid (?, +, 0). Given a d-rectangle r_upd (i.e., an orthogonal rectangle in ?^d) and a value ? ? ?, a range update adds ? to the weight of every point p ? P? r_upd; given a d-rectangle r_qry, a range sum query returns the total weight of the points in P ? r_qry. The goal is to store P in a structure to support updates and queries with attractive performance guarantees. We describe a structure of O?(n) space that handles an update in O?(T_upd) time and a query in O?(T_qry) time for arbitrary functions T_upd(n) and T_qry(n) satisfying T_upd ? T_qry = n. The result holds for any fixed dimensionality d ? 2. Our query-update tradeoff is tight up to a polylog factor subject to the OMv-conjecture

Dagstuhl Research Online Publication Server

Towards Optimal Dynamic Indexes for Approximate (and Exact) Triangle Counting

Author: Lu Shangqi
Tao Yufei
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 24th International Conference on Database Theory (ICDT 2021)
Publication date: 01/01/2021
Field of study

Dagstuhl Research Online Publication Server